TD ( X ) Converges with Probability 1
نویسنده
چکیده
The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence. Keywords, reinforcement learning, temporal differences, Q-learning
منابع مشابه
Neural Computation . On the Convergence of Stochastic
Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD( ) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of conve...
متن کاملPointwise Convergence of Some Multiple Ergodic Averages
We show that for every ergodic system (X, μ,T1, . . . ,Td) with commuting transformations, the average 1 Nd+1 ∑ 0≤n1,...,nd≤N−1 ∑ 0≤n≤N−1 f1(T n 1 d ∏ j=1 T n j j x) f2(T n 2 d ∏ j=1 T n j j x) · · · fd(T n d d ∏ j=1 T n j j x). converges for μ-a.e. x ∈ X as N → ∞. If X is distal, we prove that the average 1 N N ∑ i=0 f1(T n 1 x) f2(T n 2 x) · · · fd(T n d x) converges for μ-a.e. x ∈ X as N → ∞...
متن کاملSkorohod Representation on a given Probability Space
Let (Ω,A, P ) be a probability space, S a metric space, μ a probability measure on the Borel σ-field of S, and Xn : Ω → S an arbitrary map, n = 1, 2, . . .. If μ is tight and Xn converges in distribution to μ (in HoffmannJørgensen’s sense), then X ∼ μ for some S-valued random variable X on (Ω,A, P ). If, in addition, the Xn are measurable and tight, there are S-valued random variables ∼ Xn and ...
متن کاملFast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their “gradient temporal difference” (GTD) algorithm converges reliably, it can be very slow compared to conventional linear...
متن کاملSupplementary Appendix to “Incentive Compatibility of Large Centralized Matching Markets”
First, we summarize definitions and related theorems of asymptotic statistics in Section A. We prove Theorems in Section B. Lastly, Section C contains additional simulation results. A Asymptotic Statistics We summarize some results of asymptotic statistics from (Serfling, 1980). Let X1, X2, . . . and X be random variables on a probability space (Ω,A, P ). We say that Xn converges in probability...
متن کامل